System and method for analyzing and correcting retail data

ABSTRACT

A computer system and method is disclosed that analyzes and corrects retail data. The system and method includes several client workstations and one or more servers coupled together over a network. A database stores various data used by the system. A business logic server uses competitive and complementary fusion to analyze and correct some of the data sources stored in database server. The data fusion process itself is an iterative one—utilizing both competitive and complementary fusion methods. In competitive fusion, two or more data sources that provide overlapping attributes are compared against each other. More accurate/reliable sources are used to correct less accurate/reliable sources. In complementary fusion, relationships modeled where data sources overlap are projected to areas of the data framework in which fewer sources exist—enhancing the accuracy/reliability of those fewer sources even in the absence of the other sources upon which the models were based.

BACKGROUND

The present invention relates to computer software, and moreparticularly, but not exclusively, relates to systems and methods foranalyzing and correcting retail data.

The measurement of sales in retail channels can be done via a variety ofmethods. Initially, sample-based audits of consumer purchases atcheck-out were extensively utilized—but were costly and subject tosignificant potential inaccuracies. With the advent and accuracyimprovement in scanner-based point of sale (POS) data, tracking servicessuch as those offered by Information Resources, Inc. (IRI), and A.C.Nielsen (ACN) are able to provide highly-granular (in terms of item,venue, and time), highly-accurate measurement of sales in several retailchannels—including food/grocery, drug, mass merchandise, convenience,and military commissary. These POS-based offerings can besample-based—i.e., rely on a statistically determined subset of thetarget population—or census-based—i.e., use all available data from allavailable venues.

While POS-based measurement offerings do an excellent job of reporting“what” sold, they provide little insight into “why” something sold—sincethey provide no consumer-level data. To fill this need, market researchcompanies such as IRI and ACN have recruited national consumer panels—inwhich panelists report their households' purchases on a regular basis.This longitudinal sample allows the development of much deeper consumerinsights (e.g., brand switching, trial and repeat, etc.).

However, consumer panels are not without their problems. As with anysample-based survey, consumer panels are subject to two types oferrors—i.e., sampling errors and biases—where the total error is givenby the sum: (Total Error)²=(Sampling Error)²+(Bias)².

Sampling errors are those errors attributable to the normal (random)variation that would be expected due to the fact that, by the very actof sampling, measurements are not being taken from the entirepopulation. Sampling errors can be reduced by increasing the sample sizesince the standard deviation of the sampling distribution (oftenreferred to as the “standard error”) decreases with the square root ofthe sample size.

Biases are systematic errors that affect any sample taken by aparticular sampling method. Because these errors are systematic, theyare not affected by the size of the sample. Examples of panel biasesinclude, but are not limited to:

-   -   Recruitment bias—in which households recruited to participate in        the panel are not representative of the target population (e.g.,        the overall population of the United States);    -   Self-selection bias—in which households who choose to        participate in the panel have slightly different buying habits        than the average household (e.g., an orientation toward using        promotions or adopting new products);    -   Panelist turnover bias—in which the reporting effectiveness        (accuracy and consistency) of panelists may vary over the time        period in which they participate in the panel;    -   Hereditary bias—in which individuals within a household share a        tendency toward certain behaviors or medical conditions;    -   Compliance bias—in which certain purchases or purchase occasions        are consistently underreported by panelists;    -   Item placement bias—in which panelists report products purchased        that have not been accurately captured and/or classified in the        hierarchy maintained by the data collector; and    -   Projection bias—in which the weighting or projection system        cannot fully adjust all geo-demographics or is stressed by over-        or under-sampled segments of the target population.

While both bias and sampling error are present in consumer panel data,for panels of a size significant enough to be of use in trackingconsumer purchases (e.g., the IRI and ACN panels), the vast majority ofthe error that is present is due to bias. Further, since bias isunaffected by sample size, the negative impact of bias relative to thenegative impact of sampling error worsens as the panel size increases.

The negative impact of bias is substantially larger than that ofsampling error for most products. Increasing the size of the sample(i.e., the size of the panel) will reduce only the sampling error andmay, in fact, worsen any bias that may be present. Given the sizes oftoday's consumer panels, there is limited advantage to be gained byincreasing the size of the panel—since over 90% of the total error isoften due to non-sampling errors (i.e., bias).

There has been little progress in the area of developing a systematicmethod of identifying and quantifying these biases. Further advancementsare needed in this area.

Another area of concern in retail sales measurement is “coverage”.Coverage includes both the number of channels in which measurements arereported and the business usefulness of those measurements. WhileInformation Resources, Inc.'s (IRI's) point-of-sale (POS) based servicesprovide excellent coverage of the Food/Grocery, Drug, Mass (excludingWALMART®), Convenience, and Military channels, these channels mayaccount for only 50% of a manufacturer's sales—and as little as 20% ofits sales growth. Non-tracked, growth channels—e.g., Club, Dollar,WALMART®—are, thus, becoming an increasingly important part ofmanufacturers' businesses while at the same time having little dataavailable in the way of actionable sales measurement information.Further advancements are also needed in this area.

SUMMARY

One form of the present invention is a unique system for analyzing andcorrecting retail data.

Other forms include unique systems and methods to identify, quantify,and correct consumer panel biases. Yet another form includes uniquesystems and methods to model relationships where data sources overlap toproject values in areas in which fewer sources exist.

Another form includes operating a computer system that has severalclient workstations and servers coupled together over a network. Atleast one server is a database server that stores sale data for variousdata sources, product identifier and attribute categorizations,calculated factors, and other data. External sources can be used to feedthe data store on a scheduled or on-demand basis. At least one server isa server that contains business logic for analyzing and correcting someof the data sources stored in database server. Some client workstationscan be used to administer settings used in process of analyzing andcorrecting the data sources. Other client workstations can be used toview the corrected and/or uncorrected data in a multi-dimensional formatusing a graphical user interface.

Another form includes providing a computer system that uses multipledata sources to support inferences that would not be feasible based uponany single data source when used alone. Sales are positioned alongproduct, venue, and time dimension hierarchies. Characteristics of thedata source determine the level of aggregation at which the data can bepositioned in the framework. For example, POS data may be availableweekly in a particular channel; however, direct store delivery (DSD)data may be available at a daily level, and still other measures may beavailable only at a monthly or quarterly level. The situation is similaralong the product and venue dimensions—ranging from the specificity ofthe sale of a particular UPC-coded item at a particular store to thegenerality of total category sales within a channel (across allgeographies).

Once this data framework is populated, the data fusion process itself isan iterative one, utilizing both competitive and complementary fusionmethods. In “competitive fusion”, two or more data sources that provideoverlapping measurements along at least one dimension are compared(“competed”) against each other at some level of aggregation along theproduct, venue, and time dimensions. More accurate/reliable sources areused to correct less accurate/reliable sources. In “complementaryfusion”, relationships modeled where data sources overlap are projectedto areas of the data framework in which fewer (or even a single) sourcesexist—enhancing the accuracy/reliability of those fewer (or single)sources even in domains where data from of the other sources upon whichthe models were based do not exist. The process is iterative in that thecompetitive and complementary fusion methodologies can be repeated atvarying level of aggregation of the data framework.

Another form includes providing a method for identifying and quantifyingbiases in consumer panel data so that the inherent utility of theconsumer panel data may be enhanced. This method is termed competitivefusion. At least two data sources are used, with at least one assumed tobe more accurate than the other—e.g., scanner-based POS data andconsumer panel purchase data. The data sources are aligned along acommon framework (i.e., data model or hierarchy) along the dimensions ofproduct (item), venue (channel and/or geography), and/or time, withaggregation along these dimensions as necessary. The attributesassociated with the framework are identified along which the frameworkmay be characterized. The data sources are compared along theseattributes—quantifying the impact of the attributes on the less-accuratedata source.

After these biases have been identified and quantified, the usefulnessof the consumer panel data may be enhanced. The effect of the biases maybe corrected for via modeling; i.e., the raw data may be adjusted toreduce or eliminate the effect of the biases. Furthermore, asappropriate, panel management practices may be changed in order toremove or lessen the source of bias in the panel itself.

Yet another form of the present invention includes providing a methodfor using complementary fusion to “project” the results andrelationships from the competitive fusion method onto consumer paneldata in a channel with incomplete/less data than desired (e.g. data fromWALMART®) to help enhance the accuracy of the Panel data source. At thispoint, competitive fusion may be used again in several possible ways andat several levels of aggregation along the venue, time, and/or productdimensions in order to develop independent estimates against which thecomplementary-fused estimate may be competed:

-   -   Publicly available data about the incomplete channel (e.g.,        channel reports, reported sales and financials, store databases,        geo-demographics, etc.) may be used to develop an independent        venue (channel) estimate.    -   Publicly available data about the category of interest (e.g.,        category studies, industry reports, reported sales/financials,        etc.) may be used to develop an independent category estimate.    -   Private data from manufacturer-partners (e.g., shipment data,        delivery data, retailer-supplied data, etc.) may be used to        develop independent channel and category estimates. Due to the        potentially sensitive nature of some of these data sources, this        competitive fusion may be performed inside a manufacturer's        facility—as an auxiliary input to the baseline model.    -   Private data from retailer-partners within a Collaborative        Retail Exchange may be used in some venues to develop        independent channel and category estimates.

Yet other forms, embodiments, objects, advantages, benefits, features,and aspects of the present invention will become apparent from thedetailed description and drawings contained herein.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a diagrammatic view of a computer system of one embodiment ofthe present invention.

FIG. 2 is a multi-dimensional diagram illustrating the data space usedby the system of FIG. 1.

FIG. 3 is a block diagram illustrating selected data sources that areused by the system of FIG. 1.

FIG. 4 is a high-level process flow diagram for the system of FIG. 1.

FIG. 5A is a first part process flow diagram for the system of FIG. 1demonstrating the stages involved in performing competitive andcomplementary fusion.

FIG. 5B is a second part process flow diagram for the system of FIG. 1demonstrating the stages involved in performing competitive andcomplementary fusion.

FIG. 6A is a first part process flow diagram for the system of FIG. 1demonstrating a preferred process for calculating and applying factorsin competitive fusion.

FIG. 6B is a second part process flow diagram for the system of FIG. 1demonstrating a preferred process for calculating and applying factorsin competitive fusion.

FIG. 6C is a third part process flow diagram for the system of FIG. 1demonstrating a preferred process for calculating and applying factorsin competitive fusion.

FIG. 7A is a first part process flow diagram for the system of FIG. 1demonstrating an alternate process for calculating and applying factorsin competitive fusion.

FIG. 7B is a second part process flow diagram for the system of FIG. 1demonstrating an alternate process for calculating and applying factorsin competitive fusion.

FIG. 7C is a third part process flow diagram for the system of FIG. 1demonstrating an alternate process for calculating and applying factorsin competitive fusion.

FIG. 8 is a process flow diagram for the system of FIG. 1 demonstratingthe stages involved in performing complementary fusion.

FIG. 9 is a process flow diagram for the system of FIG. 1 demonstratingthe stages involved in iteratively performing competitive andcomplementary fusion steps.

FIG. 10 is a process flow diagram for the system of FIG. 1 demonstratingthe stages involved in calculating blended factors where multiple factormeasures are available for the same factor.

FIG. 11 is a data table illustrating hypothetical data elements storedin the database of FIG. 1 to be used in accordance with the procedure ofFIG. 6.

FIG. 12 is a data table illustrating hypothetical data elements that arestored in the database of FIG. 1 and are adjusted according to factorsfor a first attribute in accordance with the procedure of FIG. 6.

FIG. 13 is a data table illustrating hypothetical data elements that arestored in the database of FIG. 1 and are adjusted according to factorsfor a second attribute in accordance with the procedure of FIG. 6.

FIG. 14 is a data table illustrating hypothetical data elements that arestored in the database of FIG. 1 and are adjusted according to factorsfor a third attribute in accordance with the procedure of FIG. 6.

FIG. 15 is a data table illustrating hypothetical data elements storedin the database of FIG. 1, with attribute summaries, and used inaccordance with the procedure of FIG. 7.

FIG. 16 is a data table illustrating hypothetical data elements that arestored in the database of FIG. 1 and are adjusted according to factorsfor three attributes in accordance with the procedure of FIG. 7.

FIG. 17 is a data table illustrating hypothetical data elements byretailer that are stored in the database of FIG. 1 and used inaccordance with the complementary fusion procedure of FIG. 8.

FIG. 18 is a data table illustrating hypothetical data elements byretailer that are stored in the database of FIG. 1, adjusted usingcomplementary fusion according to the factors calculated in accordancewith the procedure of FIG. 7, as described in the procedure of FIG. 8.

FIG. 19 is a data table illustrating hypothetical data elements byretailer that are stored in the database of FIG. 1 and are used toperform another iteration of competitive fusion, including calculatingblended factors, as described in the procedures of FIG. 9 and FIG. 10.

FIG. 20 is a data table illustrating hypothetical data elements byretailer that are stored in the database of FIG. 1 and updated basedupon the blended factor, as described in the procedures of FIG. 9 andFIG. 10.

FIG. 21 is a data table illustrating hypothetical real, original, andcorrected values stored in the database of FIG. 1 to show how thecompetitive and complementary fusion process helped improve the data, asdescribed in the procedures of FIG. 9.

FIG. 22 is a simulated screen of a user interface for one or more clientworkstations of FIG. 1 that allows a user to view the multi-dimensionalelements in the database, as described in the procedures of FIG. 4 andFIG. 5.

DETAILED DESCRIPTION OF SELECTED EMBODIMENTS

For the purposes of promoting an understanding of the principles of theinvention, reference will now be made to the embodiment illustrated inthe drawings and specific language will be used to describe the same. Itwill nevertheless be understood that no limitation of the scope of theinvention is thereby intended. Any alterations and further modificationsin the described embodiments, and any further applications of theprinciples of the invention as described herein are contemplated aswould normally occur to one skilled in the art to which the inventionrelates.

One embodiment of the present invention includes a unique system foridentifying, quantifying, and correcting consumer panel biases, and thenusing overlapping areas of the data sources to project values in areaswhere fewer or less complete sources exist. FIG. 1 is a diagrammaticview of computer system 20 of one embodiment of the present invention.Computer system 20 includes computer network 22. Computer network 22couples together a number of computers 21 over network pathways 23 a-e.More specifically, system 20 includes several servers, namely businesslogic server 24 and database server 25. System 20 also includes externaldata sources 26, which in various embodiments include other computers,files, electronic and/or paper data sources. External data sources 26are optionally coupled to network over pathway 23 f. System 20 alsoincludes client workstations 30 a, 30 b, and 30 c (collectively clientworkstations 30). While computers 21 are each illustrated as beingeither a server or a client, it should be understood that any ofcomputers 21 may be arranged to provide both a client and serverfunctionality, solely a client functionality, or solely a serverfunctionality. Furthermore, it should be understood that while sixcomputers 21 are illustrated, more or fewer may be utilized inalternative embodiments.

Computers 21 include one or more processors or CPUs (50 a, 50 b, 50 c,50 d, and 50 e, respectively) and one or more types of memory (52 a, 52b, 52 c, 52 d, and 52 e, respectively). Each memory 52 a, 52 b, 52 c, 52d, and 52 e includes a removable memory device. Each processor may becomprised of one or more components configured as a single unit.Alternatively, when of a multi-component form, a processor may have oneor more components located remotely relative to the others. One or morecomponents of each processor may be of the electronic variety definingdigital circuitry, analog circuitry, or both.

In one embodiment, each processor is of a conventional, integratedcircuit microprocessor arrangement, such as one or more PENTIUM III orPENTIUM 4 processors supplied by INTEL Corporation of 2200 MissionCollege Boulevard, Santa Clara, Calif. 95052, USA.

Each memory (removable or generic) is one form of computer-readabledevice. Each memory may include one or more types of solid-stateelectronic memory, magnetic memory, or optical memory, just to name afew. By way of non-limiting example, each memory may include solid-stateelectronic Random Access Memory (RAM), Sequentially Accessible Memory(SAM) (such as the First-In, First-Out (FIFO) variety or theLast-In-First-Out (LIFO) variety), Programmable Read-Only Memory (PROM),Electronically Programmable Read-Only Memory (EPROM), or ElectricallyErasable Programmable Read-Only Memory (EEPROM); an optical disc memory(such as a DVD or CD ROM); a magnetically encoded hard disc, floppydisc, tape, or cartridge media; or a combination of any of these memorytypes. Also, each memory may be volatile, nonvolatile, or a hybridcombination of volatile and nonvolatile varieties.

Although not shown in FIG. 1 to preserve clarity, in one embodiment eachcomputer 21 is coupled to a display. Computers 21 may be of the sametype, or be a heterogeneous combination of different computing devices.Likewise, the displays may be of the same type, or a heterogeneouscombination of different visual devices. Although again not shown topreserve clarity, each computer 21 may also include one or more operatorinput devices such as a keyboard, mouse, track ball, light pen, and/ormicrotelecommunicator, to name just a few representative examples. Also,besides display, one or more other output devices may be included suchas loudspeaker(s) and/or a printer. Various display and input devicearrangements are possible.

Computer network 22 can be in the form of a wired or wireless Local AreaNetwork (LAN), Municipal Area Network (MAN), Wide Area Network (WAN)such as the Internet, a combination of these, or such other networkarrangement as would occur to those skilled in the art. The operatinglogic of system 20 can be embodied in signals transmitted over network22, in programming instructions, dedicated hardware, or a combination ofthese. It should be understood that more or fewer computers 21 can becoupled together by computer network 22.

In one embodiment, system 20 operates at one or more physical locationswhere business logic server 24 is configured as a server that hosts andruns application business logic 33, database server 25 is configured asa database 34 that stores reference data 35 (e.g. product identifiers 36a, attributes 36 b, and a dictionary 36 c), at least two retail datasources (such as point-of-sale and panel data) 38, calculated factors39, and other data 40. In one embodiment, external data 26 is importedto database server 25 from a mainframe extract file that is generated ona periodic basis. Various other scenarios are also possible for usingand importing external data to database server 25. In anotherembodiment, external data sources are not used. In one embodiment,database 34 of database server 25 is a relational database and/or a datawarehouse. Alternatively or additionally, database 34 can be a series offiles, a combination of database tables and external files, calls toexternal web or other services that return data, and various otherarrangements for accessing data for use in a program as would occur toone of ordinary skill in the art. Client workstations 30 are configuredfor providing one or more user interfaces to allow a user to modifysettings used by business logic 33 and/or to view the retail datasources 38 of database 34 in a multi-dimensional format. Typicalapplications of system 20 would include more or fewer clientworkstations of this type at one or more physical locations, but threehave been illustrated in FIG. 1 to preserve clarity. Furthermore,although two servers are shown, it will be appreciated by those ofordinary skill in the art that the one or more features provided bybusiness logic server 24 and database server 25 could be provided on thesame computer or varying other arrangements of computers at one or morephysical locations and still be within the spirit of the invention.Farms of dedicated servers could also be provided to support thespecific features if desired.

FIG. 2 is a multi-dimensional cube 60 that illustrates a way ofconceptually thinking about the elements stored in database 34 of system20. Cube 60 contains three dimensions: complexity 62, sources 64, andaggregation 66. In one embodiment, at least part of the data in database34 is categorized according to complexity 62, sources 64, andaggregation 66 axes of multi-dimensional cube 60 for analysis, viewing,and/or reporting. Cube 60 helps illustrate the concept that theaggregation dimension 66 is multi-dimensional, although other dimensionscould be used than illustrated. Examples of elements of the sourcedimension 64 includes client (internal) data 65 a, scanning(point-of-sale) data 65 b, panel data 65 c, audit data 66 d, and other(external) data 66 e, as a few examples. Examples of elements of theaggregation dimension 66 include time 67 a, item (product) 67 b, channel(venue) 67 c, geography (venue) 67 d, and other 67 e, to name a fewexamples. Various dimensions of cube 60 are used in the competitivefusion and complementary fusion processes described herein.

FIG. 3 is a block diagram illustrating further examples of the one ormore retail data sources (36 in FIGS. 1 and 64 in FIG. 2) that can beused by the system of FIG. 1 in the competitive fusion and complementaryfusion processes described herein. Point-of-sale data 70, consumer paneldata 72, audit/survey data 74 including causal (promotional) data,shipment data 76 from anywhere in supply chain, population census data78 including geo-demographic data, store universe data 80, other datasources 82, and specialty panels 84 are examples of the types of datathat can be used with system 20. The types of data that can be used withsystem 20 are not limited to traditional retailers. For example, datacollected during any part of the supply chain could be used as a datasource.

Referring also to FIG. 4, one embodiment for implementing system 20 isillustrated in flow chart form as procedure 150, which demonstrates ahigh-level process for the system of FIG. 1 and will be discussed inmore detail below. FIG. 4 illustrates the high-level procedures forperforming “competitive fusion” and “complementary fusion”. In“competitive fusion”, two or more data sources that provide overlappingmeasurements along at least one dimension are compared (“competed”)against each other at some level of aggregation along the product,venue, and/or time dimensions. More accurate/reliable sources are usedto correct less accurate/reliable sources. In “complementary fusion”,relationships modeled where data sources overlap are projected to areasof the data framework in which fewer (or even a single) sourcesexist—enhancing the accuracy/reliability of those fewer (or single)sources even in domains where data from of the other sources upon whichthe models were based do not exist. The process is iterative in that thecompetitive and complementary fusion methodologies can be repeated atvarying level of aggregation of the data framework.

In one form, procedure 150 is at least partially implemented in theoperating logic of system 20. Procedure 150 begins with business logicserver 24 identifying at least two data sources, with at least one datasource being more accurate than another (stage 152). At least one datasource (see e.g. 36 in FIGS. 1 and 64 in FIG. 2) is used as the“reference” data source and another is used as the “target” data sourcewith the biases to be identified and quantified. In one embodiment, thereference data source is more accurate than the target data source. Forpurposes of the tracking of sales in retail channels, scanner-basedpoint-of-sale (POS) data is typically a good “reference” source, due toits inherent accuracy and high level of granularity along the dimensionsof time, venue, and product. Alternatively or additionally,manufacturer-supplied shipment data, especially where such data is basedupon direct store delivery (DSD) information, may be utilized as a“reference” source. As yet another alternative, retailer-specific datasources (e.g., “frequent shopper” program data from loyalty cards) arealso appropriate.

Various examples herein illustrate using consumer panel purchase data asthe target data source to be corrected. However, the current inventioncan be used with other data sources, such as sample-based orsurvey-based data sources whose overall accuracy is limited by thepresence of biases, to name a few non-limiting examples.

The product characteristics of the data sources should ideally beavailable at the item level, where “item” is by UPC, SKU, or anotherunique product identifier. In terms of the venue characteristics of thedata sources, they should ideally be available at the retailer andmarket level, where “retailer” is a store (or chain of stores) within aparticular retail channel and “market” is a geographic construct (e.g.,Chicago area). In terms of the time characteristics of the data sources,they should ideally be available at the weekly level (or even daily insome cases), although monthly data (or 4-week “quad” data) or variousother time frames are also acceptable. Where these levels of granularityare not possible, more aggregated levels of the product (e.g., “brand”),venue (e.g., “food” or “mass” channel for retailer and/or “region” or“total U.S.” for market), and/or time (e.g., quarterly or annual data)dimensions may be used.

After the data sources have been identified (stage 152), they are nextaligned along a common framework (stage 154), such as along the item,venue, and/or time dimensions. Depending upon the characteristics (andquality) of the data sources, some aggregation along these dimensionsmay be required in order for the alignment to be possible. For example,UPC-level POS data may need to be aggregated at the SKU or even brandlevel in order to be aligned with data from other sources (particularlyin the cases in which venue-specific UPCs are involved). Similarly,store-level data may need to be aggregated at the local market or evenregional level in order to be aligned with consumer panel purchase data.Finally, weekly (or even daily) POS data may need to be aggregated atthe 4-week quad level in order to be aligned with shipment/deliverydata. Various other arrangements for aligning the data along a commonframework are also possible.

In one embodiment, the item structure is provided by a multiple-levelhierarchy, in which UPCs are the lowest level and are aggregated alongcategory-related characteristics. Venue structure is provided along bothgeographical and channel dimensions, with FIPS-code-level transactionsbeing aligned along market and regions and store locations being part ofa sub-chain, chain, and parent store hierarchy. Time structure ispresently provided at the weekly level at the lowest level ofaggregation, with daily data being aggregated at the weekly level beforeplacement into the structure, although a daily data compatible structureor other variation is also possible.

As a result of aligning the data sources along a common framework (stage154), overlapping attribute segments of at least one dimension areavailable to use for data comparison and correction. Certain attributesassociated with the data sources are identified along which moredetailed comparisons may be made. In one embodiment, product attributesare available in from reference data 35 of database 34. For example, oneor more pieces of information from product identifier 36 a, attributes36 b, and dictionary 36 c references can be used to access or modifyattributes, attribute hierarchies, and mappings. These attributesrepresent category-specific dimensions along which products in thatcategory may be characterized (e.g., diet vs. regular in carbonated softdrinks, active ingredient in internal analgesics, product size in mostcategories). The term attribute used herein is meant in the genericsense to cover various types of descriptors.

Business logic server 24 compares the data sources and calculatesfactors for the attributes of at least one element of the commonframework (stage 158). Each segment of a given attribute will have itsown factor, as described in detail herein. The presence ofattribute-related bias may be identified by comparison of the datasources. In the examples illustrated herein, volumetric comparisons aremade (e.g., equivalent units); however, various other measures (e.g.,dollar sales, actual units) could also be utilized, as long as the sametype of measure is being used for the comparison. For example, it wouldnot be useful to compare dollar sales to actual units, but it would beuseful to compare dollars to dollars. The comparison itself is betweenthe value of the target data source (e.g., projected panel volume) andthat of the reference data source (e.g., POS data). This comparison canbe by way of two-sample inference, regression analysis, or otherstatistical tests appropriate for determining whether any differencesbetween the two data sources are associated with the attributes alongwhich they have been characterized at a statistically significant level.Where such differences (biases) are identified, they are quantified, andfactors are calculated for use in bias correction/adjustment.

The factors are used to correct bias in the less accurate data source(stage 160), which in this example is consumer panel data. By using thefactors to correct the bias in the less accurate “target” data source,the effect of these biases is reduced or eliminated. These biases can becorrected by adjusting the raw data, or by way of post-adjustment.

In “complementary fusion”, the factors are also used to supplement thedata that is incomplete in the less complete data source (stage 162),such as consumer panel data. Incomplete data is used in a general senseto mean that less data was provided than desired or that the data isless accurate than desired, to name a few non-limiting examples. Wherehighly accurate data (e.g. POS data) is not provided, less accurate data(e.g. panel data) becomes more important to analyze and correct.Relationships modeled where data sources overlap are projected to areasof the data framework in which fewer (or even a single) sources exist,enhancing the accuracy and reliability of those fewer (or single)sources even in domains where data from of the other sources upon whichthe models were based do not exist.

Users and/or reports can access database 34 from one of clientworkstations 30 to view/analyze the corrected and adjusted data (stage164). Users and/or reports can also access database 34 from one ofclient workstations 30 to view and/or modify settings used by system 20to make data corrections. The steps are repeated as desired (stage 166).The process then ends at stage 168.

FIGS. 5A-5B are first and second parts of a process flow diagram for thesystem of FIG. 1 demonstrating the stages involved in performingcompetitive and complementary fusion using POS and panel data as thedata sources. While in this and other figures, the first data source(the “source” data source) is described as being POS data and the seconddata source (the “target” data source) is described as being panel data,it will be appreciated that the system and methodologies can be usedwith other data sources as appropriate. In one form, procedure 170 is atleast partially implemented in the operating logic of system 20.Procedure 170 begins in FIG. 5A with receiving updates for referencedata 35 and/or data sources 38 on a periodic basis (stage 172).

In one embodiment, a parameter specification for the number of weeksused in calculating the factors is thirteen, and the minimum week rangeincluded in database 34 is then set to be thirteen weeks prior to theupdate week. Database 34 may be built and maintained using various datasources and can include various types of data, as would occur to one ofordinary skill in the art. In one embodiment, system 20 supports theoption to pull the desired period (e.g. all thirteen weeks) of the datasources 38, append the recent period (e.g. four weeks) needed since thelast factor update to the existing database 34, and/or be able torecreate the data a week at a time. In such a scenario, for spaceconservation, the system can optionally drop the same number of weeksfrom the start week of database 34 as were appended to the end week. Forexample, if the option was chosen to append the four weeks needed sincethe last factor update, the system should drop the four oldest weeksfrom the existing database 34 when appending the four new weeks.

The received updates to reference data 35 and/or data sources 38 arestored in database 34 (stage 174). At some point in time, such as on ascheduled or as-requested basis, the system determines that dataadjustments should be made to correct bias (decision point 175).Application business logic 33 ensures reference data 35 and data sources38 are up to date, and if not, updates them accordingly (stage 176).Optionally, reference data 35 is reviewed to ensure that the defaultattributes for the current category will be appropriate for the clientor scenario, and adjustments are made to reference data 35 asappropriate (stage 177). As one non-limiting example, attribute segmentsmay be reviewed and translated to more succinct segmentations thatbetter classify the product identifiers. Other variations are alsopossible.

A product-identifier-to-attribute-segment mapping is prepared for theproduct identifiers (e.g. UPC's) (stage 178). If the attributes aredetermined to be irrelevant, they can be removed from furtherconsideration in this process. The attribute table 36 b is a referencetable that maps each product identifier 36 a to a set of attributevariables. While UPC's are described as a common product identifier,other identifiers could also be used. For example, not every dataset hasa UPC, but may have a product identifier at a higher, lower, orequivalent level. Rules are used to determine supportable attributesegments and relevant attributes. In one embodiment, if segmentassignment is missing then the UPC is assigned to a new segment “notsupportable.” All segments with less than a 5% share are assigned to“not supportable.” Furthermore, in one embodiment, if the final “notsupportable” category accounts for >50% of the category share, then theattribute is designated as “irrelevant.” Other ways for determiningrelevance can also be used, or relevance can simply be ignored. Stage178 can be repeated to arrive at the final level of segments to use(rolled-up or drilled-down) as appropriate.

Continuing with FIG. 5B, source (e.g. POS) and target (e.g. panel) data38 are retrieved from database 34 and summarized by attribute segments(stage 180). Factors are calculated for attribute segments (stage 181).The significance of the attribute segments is determined (stage 182). Ifany non-significant factors are determined, the significant attributefactors can be re-aligned (stage 183). The factors for each attributesegment are applied to the target (panel) data to correct bias (stage184). The factors are also applied to the target (panel) data to correctdata that is incomplete (e.g. less available) (stage 186). Thecompetitive and/or complementary data fusion steps can be repeated asdesired or appropriate (stage 187). Users and/or reports can accessdatabase 34 from one of client workstations 30 to view/analyze thecorrected and adjusted data (stage 188). The procedure 170 then ends atstage 190. FIGS. 6-10 illustrate the competitive and complementaryfusion stages in further detail.

FIGS. 6A-6C are first, second, and third parts of a process flow diagramfor the system of FIG. 1 demonstrating a preferred process foriteratively calculating and applying factors in competitive fusion. Inone form, procedure 200 is at least partially implemented in theoperating logic of system 20. Procedure 200 begins on FIG. 6A withsumming source (POS) data by the most granular product and timedimension (e.g. UPC) (stage 202) and summing target (panel) data by themost granular product and time dimension (e.g. UPC) (stage 204). In oneembodiment, they are both summed to weekly (e.g. 52) totals. Businesslogic server 24 determines the period of time to use in the analysis(stage 206), such as to use all of the weekly totals summed in the priorstep or to use only part of the weekly totals that cover a desired timeperiod, such as the most recent 13 weeks, to name a few examples.Outliers are also eliminated (stage 207) at this point or anotherappropriate point before final calculations. For example, in oneembodiment, although thirteen weeks are contained in the dataset, only11 weeks are actually used in calculations. Research indicates thatpanel volume is extremely vulnerable to outliers. To minimize thepotential impact of outliers, the week with the lowest coverage and theweek with the highest coverage are eliminated from further use incalculations for the current update. In one embodiment, although theoutlier weeks are eliminated from further use in calculations for thecurrent update, they are not removed from the dataset as they may beused in subsequent updates. Business logic server 24 then merges thesource (POS) data, target (panel) data, and product identifier toattribute segment mapping reference data (stage 208). Attributes canoptionally be sorted in order by importance (stage 210). In oneembodiment, the least important is first and the most important is last.If factors for the most important attribute segments are the last onesapplied, it usually has the most significant mathematical effect becauseno lesser important attribute segment factor will be applied after thatlast calculation to further skew the results.

An initial factor of 1.0 is assigned to all attribute segment (stage212). Continuing with FIG. 6B, source (POS) and target (panel) data arethen summarized for the segments of the current attribute (stage 214). Afactor is calculated for each attribute segment of the current attributeas source data volume divided by target data volume (stage 216). Othermathematical variations could also be used. For each segment of thecurrent attribute, determine whether the attribute segment issignificant (stage 218). In one embodiment, shares are calculated forthe attribute segments, such as by dividing the Calculation PeriodSegment Total U.S. POS volume by the Calculation Period Category TotalU.S. POS volume. Significance is then determined by first analyzing aconfidence interval (CI) for each share to determine if there is overlapbetween the POS share CI and the panel share CI. If there is overlap,then the difference between source and target shares is not significantand the attribute segment will be designated as “nonsignificant.” Otherways for determining significance can also be used, or significance canbe assumed.

In one embodiment, if two or more segments for the current attributewere nonsignificant (stage 220), then the significant factors (thatremain) will need to be re-aligned to account for non-significantsegment factors being removed (stage 222). At the productidentifier-level target (POS) data, each volume is multiplied by thefactor for the corresponding segment (stage 224). Again, othermathematical variations could also be used. The factors for eachattribute segment are then saved to factor data store 39 of database 34(stage 226). If another attribute is present (decision point 228), thenext attribute is made the current attribute (stage 230) and stages214-226 are repeated. These stages are repeated until all attributes areprocessed. Continuing with FIG. 6C, a category adjustment factor isapplied to all product identifiers as necessary (stage 232) to adjustfor the level of coverage. In one embodiment, the use of a categoryadjustment factor depends on the type of measure being used. Forexample, where volume is used, coverage adjustments may not benecessary, but where shares are used, further coverage adjustments maybe necessary. Any final factors for the category adjustment factor aresaved to the factor data store 39 of database 34 (stage 234). Theprocess 200 then ends at stage 236.

FIGS. 7A-&C are first, second, and third parts of a process flow diagramfor the system of FIG. 1 demonstrating an alternate process forcalculating and applying factors in competitive fusion. In one form,procedure 250 is at least partially implemented in the operating logicof system 20. Procedure 250 begins on FIG. 7A with summing the morereliable (source) data source (e.g., POS data) by the most granularproduct and time dimension (e.g. UPC) (stage 252) and summing the lessaccurate (target) data source (e.g., panel data) by the most granularproduct and time dimension (stage 254). Business logic server 24determines the period of time to use in the analysis (stage 256) andeliminates outliers (stage 257), as discussed in FIG. 6. Source data,target data, and product identifiers to attribute segment mapping dataare merged (stage 258). An initial factor of 1.0 is assigned to eachattribute segment (stage 260). Source and target data are summarized tothe segments for all attributes (stage 262).

Continuing with FIG. 7B, factors are calculated for each attributesegment as source volume divided by target volume (stage 264). Businesslogic server 24 determines whether the attribute segment is significant(stage 266), as described in FIG. 6. Where two or more segments for anyparticular attribute are insignificant (decision point 268), then thesignificant factors are re-aligned to account for the elimination of theinsignificant segment factors in the particular attribute (stage 270).At the product identifier-level target data, each volume is multipliedby the factor for each corresponding segment (stage 272). In otherwords, all of the factors applicable to the volume are appliedsimultaneously, as opposed to iteratively as shown in FIG. 6. Thefactors are then saved to factor data store 39 for each attributesegment (stage 274).

Continuing with FIG. 7C, a category adjustment factor is applied to allproduct identifiers as necessary (stage 276), as described in FIG. 6.The final factors for the category adjustment factor are saved to thefactor data store 39 of database 34 (stage 277). The procedure 250 thenends at stage 278. Procedure 250 should only be used in the appropriatecircumstances, such as when the attributes are not affected by eachother and iteration is not needed for greater accuracy, to name oneexample. If attributes are affected by each other and procedure 250 isused instead of the iterative procedure of FIG. 6, then the results willbe mathematically different, with the procedure of FIG. 6 producing amore accurate result.

FIG. 8 is a process flow diagram for the system of FIG. 1 demonstratingthe stages involved in performing complementary fusion. In one form,procedure 280 is at least partially implemented in the operating logicof system 20. Procedure 280 begins with merging source data, targetdata, and product identifier data to attribute segment mapping data(stage 282). The factors previously calculated in accordance with FIG. 6or FIG. 7 are applied to the product identifier-level target data basedon the attribute segment mapping to correct the data for incompleteness(e.g. less data than desired) (stage 286). The target data elements thatare corrected in this process can be the same, different, or overlappingfrom the target data that was used to help calculate the factors. Theprocedure 280 then ends at stage 288.

FIG. 9 is a process flow diagram for the system of FIG. 1 demonstratingthe stages involved in performing repeating competitive andcomplementary fusion steps multiple times. In one form, procedure 290 isat least partially implemented in the operating logic of system 20.Procedure 290 begins with determining what additional public or privatedata sources are available to use for competitive fusion along venue,time, and/or product dimensions (stage 292). Using one or more of thosedata sources, additional factors are calculated that are independentestimates against which the complementary-fused estimate may be competed(stage 294). The newly calculated factors are applied to the productidentifier-level target data (e.g. POS data) to further adjust the data(stage 296). The competitive and complementary fusion steps can berepeated as desired and/or appropriate (stage 298). The procedure 290then ends at stage 299.

FIG. 10 is a process flow diagram for the system of FIG. 1 demonstratingthe stages involved in calculating blended factors where multiple factormeasures are available for the same factor. In one form, procedure 300is at least partially implemented in the operating logic of system 20.Procedure 300 can be used when competitive fusion is being performed andat least two data sources are available for the same factor (stage 302).For each aggregation (venue, time, or product) that has at least twofactor measures, calculate specific totals are calculated acrossattributes (stage 304). Factors for each aggregation of the current datasource are calculated by dividing source data volume by target datavolume (stage 305). If there are more data sources (decision point 306),then move to the next data source (stage 307) and repeat stages 304-305.Then, calculate a blended factor (stage 308) where the more accuratesource is given a higher weight and the less accurate source is given alower weight. One simple way of calculating a blended factor is tocalculate a central tendency—e.g., mean or median—of the various factorsas the overall factor. This treats all estimates as of equal value(reliability, accuracy, precision), which in reality may or may not bethe case. In a preferred embodiment, the “blended factor” uses an“inverse-variance-weighted” method (see 444 on FIG. 19 as an example).This name originates from the fact that more “reliable” estimates—i.e.,those with more precision and, thus, less variability—are given moreweight than those that are less “reliable” (more variable). Once theblended estimate has been calculated, multiply each volume of theproduct identifier-level target data by the blended factor (stage 310).The procedure 300 then ends at stage 312.

A hypothetical example will now be described in FIGS. 11-21 to withreference to the procedures described in FIGS. 6-10. FIG. 11 is a datatable illustrating hypothetical data elements that are adjustedaccording to the preferred embodiment competitive fusion procedure ofFIG. 6. POS data 320, panel data 322, and attribute information 324 areshown in a summarized form by UPC 326. For each attribute and itscorresponding segments, various steps are performed as discussed below.

Turning to FIG. 12, the data is assumed to be relevant and the POS andpanel data shown in table 330 are then summarized for the segments ofthe current attribute (stage 214), which in the current iteration ismanufacturer 332. Private brand label summaries 334 and non-privatebrand label summaries 336 for POS 338 and panel data 340 are calculatedfrom table 330 as illustrated. A factor 342 for each attribute segmentof the current attribute, in this case private label manufacturer 334and non-private label manufacturer 336 segments, is calculated as POSvolume 338 divided by panel volume 340 (stage 216). Business logicserver 24 determines whether the current attribute segment issignificant (stage 218). For purposes of illustrating the currentexample, all attribute segments are also assumed significant. At the UPClevel panel data, each panel volume 344 is multiplied by the factor 342for its corresponding segment (stage 224) to arrive at an adjusted panelvalue 346. Factors 342 are saved to the factor data store 39 of database34 (stage 226).

As shown in FIGS. 13 and 14, stages 214 to 226 repeat for eachattribute, with previously adjusted data being used in the calculation.FIG. 13 illustrates data elements being adjusted according to factorscalculated for a second attribute in accordance with the procedure ofFIG. 6. The POS and panel data shown in table 350 are then summarizedfor the segments of the current attribute (stage 214), which in thecurrent iteration is type 352. Summaries for regular type 354 andspecial type 356 for POS 358 and panel data 360 are calculated fromtable 350 as illustrated. A factor 362 for each attribute segment of thecurrent attribute, in this case regular type 354 and special type 356segments, is calculated as POS Volume 358 divided by panel volume 360(stage 216). At the UPC level panel data, the previously adjusted panelvolume 364 is multiplied by the factor 362 for its corresponding segment(stage 224) to arrive at yet another adjusted panel value 366. Factors362 are saved to the factor data store 39 of database 34 (stage 226).

FIG. 14 illustrates data elements being adjusted according to factorscalculated for a third attribute in accordance with the procedure ofFIG. 6. The POS and panel data shown in table 370 are then summarizedfor the segments of the current attribute (stage 214), which in thecurrent iteration is size 372. Summaries for size big 374, size medium375, and size small 376 for POS 378 and panel data 380 are calculatedfrom table 370 as illustrated. A factor 382 for each attribute segmentof the current attribute, in this case size big 374, medium 375, andsmall 376 segments, is calculated as POS Volume 378 divided by panelvolume 380 (stage 216). At the UPC level panel data, each previouslyadjusted panel volume 384 is multiplied by the factor 382 for itscorresponding segment (stage 224) to arrive at yet another adjustedpanel value 386. Factors 382 are saved to the factor data store 39 ofdatabase 34 (stage 226). After processing all attributes, the finalfactors are saved to the factor data store 39 of database 34 (stage234). The process then ends at stage 236.

FIGS. 15 and 16 illustrate data elements being adjusted according tofactors calculated according to an alternative embodiment competitivefusion process in accordance with the procedure of FIG. 7. Businesslogic server 24 determines the period of time to use in the analysis(stage 256), and merges POS, panel, and attribute information by UPC asshown in table 390 (stage 258). POS data 392 and panel data 394 aresummarized for all attribute segments (stage 262), in this case bymanufacturer 396, type 398, and size 400. As shown in FIG. 16, factorsfor each attribute segment 402 are calculated as each respective POSvolume 404 divided by each respective panel volume 406 (stage 264). Eachpanel volume 407 is multiplied by the factors 408 a-408 c appropriatefor its corresponding segment (stage 272) to calculate an adjusted panelvalue 410. The process then ends at stage 278.

FIG. 17 is a data table illustrating hypothetical data elements byretailer that are stored in the database of FIG. 1 and used inaccordance with the complementary fusion procedure of FIG. 8. POS, paneland attribute information are merged by UPC (stage 282) for multipleretailers, as shown in table 420. Client shipment data 424, another datasource available, is also merged by UPC. Shares are calculated for POSdata 420 a-420 b and panel data 422 a-422 c for the segments of eachattribute (stage 284). As shown in FIG. 18, the previously calculatedfactors 430 a-430 c (408 a-408 c in FIG. 16) are applied to the UPClevel panel data 432 a-432 c to further adjust the data to correct forincompleteness (stage 286) and arrive at an adjusted panel value 434a-434 c. The complementary fusion process then ends at stage 288.

FIGS. 19 and 20 illustrate performing another iteration of competitivefusion, including calculating blended factors, as described in theprocedures of FIG. 9 and FIG. 10. Additional public or private datasources are identified as available to use for competitive fusion (stage292). As shown in table 438, channel specific totals 440 a-440 f acrossattributes have been identified for use in competitive fusion. Inaddition to POS and Panel totals for retailers 1 and 2 (440 a-440 d),client shipment total 440 e and panel total 440 f can also be used forcomparison. Using these totals 440 a-440 f, additional factors 442 havebeen calculated that are independent estimates against which thecomplementary-fused data from FIG. 18 may be competed (stage 294). Ablended factor 444 has been calculated since multiple data sources wereavailable for the same factor (stages 302-308 in FIG. 10). As shown inFIGS. 19 and 20, each volume 446 a-446 c of the previously adjustedUPC-level panel data is then multiplied by the blended factor to arriveat the newly adjusted panel values 450 a-450 c (stage 298 in FIG. 9, andstage 310 in FIG. 10).

FIG. 21 is a data table illustrating hypothetical table 460 of endresults for POS data elements by retailers 2 and 3, with a comparison toreality FIGS. 462 a-462 b, pre-fusion FIGS. 464 a-464 b, and post-fusionFIGS. 466 a-466 b to show how the competitive and complementary fusionprocesses according to FIGS. 4-10 and illustrated in the hypothetical ofFIGS. 11-20 helped improve the data accuracy.

FIG. 22 is a simulated screen of a user interface for one or more clientworkstations 30 that allows a user to view the multi-dimensionalelements in the database, as described in the procedures of FIG. 4 andFIG. 5.

Alternatively or additionally, once data fusion has been performed asdescribed herein, the updated data can be used by various systems,users, and/or reports as appropriate.

In one embodiment of the present invention, a method is disclosedcomprising identifying a plurality of data sources, wherein at least afirst data source is more accurate than a second data source;identifying a plurality of overlapping attribute segments to use forcomparing the data sources; calculating a factor as a function of eachof the plurality of overlapping attribute segments; and using thefactors to update a first group of values in the second data source toreduce bias.

In another embodiment of the present invention, a method is disclosedcomprising receiving point-of-sale data and panel data on a periodicbasis; identifying a plurality of product identifiers and a plurality ofattributes to analyze; retrieving and summarizing the point-of-sale dataand the panel data by the plurality of product identifiers, theplurality of attributes, and a plurality of corresponding attributesegments for a specified time period; calculating a factor for eachattribute segment of a particular attribute; and applying the factorsfor the particular attribute segment to the panel data to correct panelbias.

In yet another embodiment, a method is disclosed comprising receivingpoint-of-sale data and panel data on a periodic basis; identifying aplurality of product identifiers and a plurality of attributes toanalyze; retrieving and summarizing the point-of-sale data and the paneldata by the plurality of product identifiers, the plurality ofattributes, and a plurality of corresponding attribute segments for aspecified time period; calculating a plurality of factors, wherein onefactor is calculated for each attribute segment of the plurality ofattributes; and applying the factors to the second data source to reducebias; and applying the factors to the second data source to reduceincompleteness.

In yet a further embodiment, a method is disclosed comprisingidentifying a plurality of product identifiers and a plurality ofattributes to analyze for at least two data sources, wherein at least afirst data source is more accurate than a second data source; retrievingand summarizing the first data source and the second data source by theplurality of product identifiers, the plurality of attributes, and aplurality of corresponding attribute segments for a specified timeperiod; calculating a plurality of factors, wherein one factor iscalculated for each attribute segment of the plurality of attributes;applying the factors to the second data source to reduce bias; andapplying the factors to a different or overlapping dataset of the seconddata source to reduce incompleteness.

In another embodiment, a system is disclosed that comprises one or moreservers being operable to store retail data from at least two datasources, store product identifier and attribute categorizations, andstore a plurality of factor calculations; wherein the at least two datasources includes a first data source that is more accurate than a seconddata source; and wherein one or more of said servers contains businesslogic that is operable to identify and retrieve a plurality ofoverlapping attribute segments to use for comparing the at least twodata sources, compare each of the overlapping attribute segments,calculate a factor for each of the overlapping attribute segments, anduse the factors to update a first group of values in the second datasource to reduce bias.

In yet a further embodiment, an apparatus is disclosed that comprises adevice encoded with logic executable by one or more processors to:identify and retrieve a plurality of overlapping attribute segments touse for comparing at least two data sources, wherein the at least twodata sources includes a first data source that is more accurate than asecond data source, compare each of the overlapping attribute segments,calculate a factor for each of the overlapping attribute segments, anduse the factors to update a first group of values in the second datasource to reduce bias.

A person of ordinary skill in the computer software art will recognizethat the client and/or server arrangements, user interface screencontent, and data layouts could be organized differently to includefewer or additional options or features than as portrayed in theillustrations and still be within the spirit of the invention.

While the invention has been illustrated and described in detail in thedrawings and foregoing description, the same is to be considered asillustrative and not restrictive in character, it being understood thatonly the preferred embodiment has been shown and described and that allequivalents, changes, and modifications that come within the spirit ofthe inventions as described herein and/or by the following claims aredesired to be protected.

1. A method comprising: identifying a plurality of data sources, whereinat least a first data source is more accurate than a second data source;identifying a plurality of overlapping attribute segments to use forcomparing the data sources; calculating a factor as a function of eachof the plurality of overlapping attribute segments; and using thefactors to update a first group of values in the second data source toreduce bias.